Biodiversity in Melbourne's parks
Authored by: Francis Rusli
Date: April 2024
Duration: 90 mins
Level: Intermediate
Pre-requisite Skills: Python, basic machine learning, Optional Google Collaborate access
Dataset 1: Bat Records in Fitzroy Gardens and Royal Botanic Gardens 2010
Dataset Link
Metadata Link

Dataset 2: Butterfly biodiversity survey 2017
Dataset Link
Metadata Link

Dataset 3: Bioblitz 2016
Dataset Link
Metadata Link

Project Objective, Overview & Research

¶

User Story
"As a city council member, I want to study and monitor the biodiversity in Melbourne's parks and green spaces, specifically focusing on bats and butterflies while also incorporating other animal species. This information can empower City of Melbourne(CoM) in making conservation planning, address environmental challenges, and foster the natural habitats inside our city limits."

Objective
The 'Urban Biodiversity Monitoring' project is a comprehensive initiative, spanning a dedicated investigation period of 6 weeks, focused on the parks and green spaces of Melbourne. This project is specifically tailored to study and monitor the populations of bats and butterflies, two critical indicators of the ecological health in urban environments.

Part 1 includes set up, fetching or loading datasets, pre-processing, data cleaning, saving datasets, and merging datasets.

Part 2 contains an overview analysis of data structures, population spread and biodiversity of species.
Overview
The biodiversity within the City of Melbourne's parks and green spaces, especially the populations of bats and butterflies, can significantly reflect the health and balance of our urban ecosystem. This project aims to investigate the population spread, habitat viability, and the interaction of these species with urban development.

Benefits
  1. Environmental Health and Ecosystem Balance Assessment: By monitoring the populations of bats and butterflies, we can gain insights into the overall health of the ecosystem in Melbourne's parks. These species are often considered indicator species[1], meaning changes in their populations can signal changes in the broader environmental conditions. For instance, bats play a crucial role in controlling insect populations and pollinating plants[2], while butterflies are indicators of ecological diversity and health. Studying these populations helps in understanding the impacts of urban development on natural habitats and the effectiveness of current conservation strategies.
  2. Informed Conservation and Urban Planning: The data gathered will be crucial in guiding conservation efforts and urban planning decisions. Understanding where and how these species thrive can inform the development of green spaces that support biodiversity. This can lead to the creation of urban environments that are not only beneficial for wildlife but also enhance the quality of life for city residents. For example, identifying key habitats and migration patterns of these species can aid in designing parks and green corridors that promote their conservation.
  3. Public Engagement and Education: This research can also play a significant role in public education and engagement. By raising awareness about the importance of biodiversity in urban areas, We can foster a greater appreciation and understanding among the public. This can lead to increased community involvement in conservation efforts and sustainable practices[5]. Moreover, it can help in promoting citizen science initiatives, where locals can contribute to data collection and monitoring, further expanding the scope and impact of your research.
Research
This research project focuses on assessing the biodiversity in Melbourne's parks, with a special emphasis on the populations of bats and butterflies. The aim is to analyze how these species interact with urban development and their spread across different locations and times. The findings will be crucial for informing conservation planning and understanding the health of the urban ecosystem.

Part 1 (First 3 Weeks): Utilizing Python and Jupyter Notebook, the initial stage involves setting up, data cleaning, and preprocessing, followed by a basic analysis focusing on species, locations, and times. The outcome will be an initial understanding of the distribution of bats and butterflies in Melbourne's parks.

Part 2 (Next 3 Weeks): This phase dives into detailed trend analysis and mapping. It aims to uncover patterns in the distribution of these species, hypothesize the reasons for their specific locations, and understand the impact of urban environments on them. The project concludes with strategic recommendations for biodiversity conservation and urban planning.

Conclusion
In this analysis, we explored comprehensive datasets concerning bat and butterfly populations within Melbourne's parks. Our focus was on accurately capturing the distribution and prevalence of these species to support ongoing environmental monitoring and conservation initiatives. Through various visualizations, including histograms, pie charts, and interactive maps, we provided a clear and engaging representation of the data. These visualizations highlight the spatial distribution of different species across multiple locations, offering insights into their habitat preferences and population densities.

This study not only aids in understanding the current status of bat and butterfly populations in Melbourne but also serves as a crucial tool for policymakers and conservationists. By identifying trends and potential areas of concern, conservation efforts can be better directed to preserve the biodiversity within these urban green spaces. Future analyses may expand on this by incorporating additional environmental variables, which will further enhance our understanding and facilitate more effective conservation strategies.

References

  • [1] Environmental controls on butterfly occurrence and species richness in Israel: The importance of temperature over rainfall
  • [2] Insectivorous bats provide significant economic value to the Australian cotton industry
  • [3] The effects of light and noise from urban development on biodiversity: Implications for protected areas in Australia
  • [4] Biodiversity in the City: Fundamental Questions for Understanding the Ecology of Urban Green Spaces for Biodiversity Conservation
  • [5] Butterfly conservation in Australia: the importance of community participation
  • Part 1 (Set up & Pre-processing)¶

    • Set Up
    • Pre-processing

    Part 1.1: Set Up¶

    • Import Core Libraries
    • Import Dependencies
    In [58]:
    ##### Install packages
    !pip install osmnx
    !pip install tqdm
    
    Requirement already satisfied: osmnx in /opt/anaconda3/lib/python3.11/site-packages (1.9.2)
    Requirement already satisfied: geopandas>=0.12 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (0.14.3)
    Requirement already satisfied: networkx>=2.5 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (3.1)
    Requirement already satisfied: numpy>=1.20 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (1.26.4)
    Requirement already satisfied: pandas>=1.1 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (2.1.4)
    Requirement already satisfied: requests>=2.27 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (2.31.0)
    Requirement already satisfied: shapely>=2.0 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (2.0.3)
    Requirement already satisfied: fiona>=1.8.21 in /opt/anaconda3/lib/python3.11/site-packages (from geopandas>=0.12->osmnx) (1.9.6)
    Requirement already satisfied: packaging in /opt/anaconda3/lib/python3.11/site-packages (from geopandas>=0.12->osmnx) (23.1)
    Requirement already satisfied: pyproj>=3.3.0 in /opt/anaconda3/lib/python3.11/site-packages (from geopandas>=0.12->osmnx) (3.6.1)
    Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=1.1->osmnx) (2.8.2)
    Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=1.1->osmnx) (2023.3.post1)
    Requirement already satisfied: tzdata>=2022.1 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=1.1->osmnx) (2023.3)
    Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (2.0.4)
    Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (3.4)
    Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (2.0.7)
    Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (2024.2.2)
    Requirement already satisfied: attrs>=19.2.0 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (23.1.0)
    Requirement already satisfied: click~=8.0 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (8.1.7)
    Requirement already satisfied: click-plugins>=1.0 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (1.1.1)
    Requirement already satisfied: cligj>=0.5 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (0.7.2)
    Requirement already satisfied: six in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (1.16.0)
    Requirement already satisfied: tqdm in /opt/anaconda3/lib/python3.11/site-packages (4.65.0)
    
    In [1]:
    #Import core libraries
    import requests
    import pandas as pd
    import numpy as np
    import os
    
    import seaborn as sns
    import matplotlib.pyplot as plt
    import json
    
    import ipywidgets as widgets
    from ipywidgets import interact
    
    import osmnx as ox
    import geopandas as gpd
    import networkx as nx
    
    from sklearn.preprocessing import LabelEncoder
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_squared_error, r2_score
    from sklearn.preprocessing import StandardScaler
    
    import warnings
    warnings.filterwarnings("ignore")
    
    In [2]:
    # Define the company colors format for matplotlib
    dark_theme_colors = ['#08af64', '#14a38e', '#0f9295', '#056b8a', '#121212'] #Dark theme
    light_theme_colors = ['#2af598', '#22e4ac', '#1bd7bb', '#14c9cb', '#0fbed8', '#08b3e5'] #Light theme
    
    In [3]:
    def fetch_data(base_url, dataset, api_key, num_records=99, offset=0):
        all_records = []
        max_offset = 9900
    
        while True:
            if offset > max_offset:
                break
    
            filters = f'{dataset}/records?limit={num_records}&offset={offset}'
            url = f'{base_url}{filters}&api_key={api_key}'
    
            try:
                result = requests.get(url, timeout = 10)
                result.raise_for_status()
                records = result.json().get('results')
            except requests.exceptions.RequestException as e:
                raise Exception(f'API request failed: {e}')
            if records is None:
                break
            all_records.extend(records)
            if len(records) < num_records:
                break
    
            offset += num_records
    
        df = pd.DataFrame(all_records)
        return df
    
    BASE_URL = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    API_KEY = ''
    

    Part 1.2: Pre-Processing¶

    • Fetch each dataset, CoM API or load CSV
    • Load data to dataframe
    • Data cleaning (duplicates, missing values, data types, etc)
    • Save cleaned dataset
    • Encode Data
    • Save encoded dataset
    • Correlations
    • Merge datasets

    Dataset 1: Bat records in fitzroy gardens and royal botanic gardens 2010¶

    • Dataset Identifier: bat-records-in-fitzroy-gardens-and-royal-botanic-gardens-2010

    Dataset Link

    Summary: This dataset provides detailed observations of various bat species within Fitzroy Gardens and Royal Botanic Gardens over the year 2010. It includes taxonomic classification, common names, and exact locations within the parks.

    This dataset includes observations of bats categorized by taxa, genus, and species with specific geo-spatial information, indicating precise observation spots within the parks. The data also lists common names alongside the scientific taxonomy to assist in species identification.

    Note: Each observation is pinpointed with latitude and longitude coordinates, providing exact locations but not the area affected or the size of the bat populations.

    Note: This dataset is crucial for ecological monitoring and conservation efforts, helping track bat populations and distribution within urban parklands.

    In [75]:
    SENSOR_DATASET = 'bat-records-in-fitzroy-gardens-and-royal-botanic-gardens-2010'
    bat = fetch_data(BASE_URL, SENSOR_DATASET, API_KEY)
    bat.head()
    
    Out[75]:
    taxa kingdom phylum class order family genus species common_name park_name location
    0 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE Mormopterus None None Royal Botanic Gardens {'lon': 144.9804, 'lat': -37.8312}
    1 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VESPERTILIONIDAE Chalinolobus gouldii Gould's Wattled Bat Fitzroy Gardens {'lon': 144.9786, 'lat': -37.8135}
    2 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VESPERTILIONIDAE Chalinolobus gouldii Gould's Wattled Bat Royal Botanic Gardens {'lon': 144.9804, 'lat': -37.8312}
    3 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE Austronomous australis White-striped Freetail Bat Fitzroy Gardens {'lon': 144.9786, 'lat': -37.8135}
    4 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE Mormopterus None None Fitzroy Gardens {'lon': 144.9786, 'lat': -37.8135}
    In [76]:
    #View info of data
    bat.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 11 columns):
     #   Column       Non-Null Count  Dtype 
    ---  ------       --------------  ----- 
     0   taxa         10 non-null     object
     1   kingdom      10 non-null     object
     2   phylum       10 non-null     object
     3   class        10 non-null     object
     4   order        10 non-null     object
     5   family       10 non-null     object
     6   genus        10 non-null     object
     7   species      5 non-null      object
     8   common_name  5 non-null      object
     9   park_name    10 non-null     object
     10  location     10 non-null     object
    dtypes: object(11)
    memory usage: 1012.0+ bytes
    
    In [77]:
    # Check missing values for dataset
    missing_values = bat.isnull().sum()
    missing_values # Number of missing values in each column
    
    Out[77]:
    taxa           0
    kingdom        0
    phylum         0
    class          0
    order          0
    family         0
    genus          0
    species        5
    common_name    5
    park_name      0
    location       0
    dtype: int64
    In [78]:
    # Column names to check for missing values
    column_name_species = 'species'
    column_name_common_name = 'common_name'
    
    # Calculate the percentage of missing values
    percentage_missing_species = (missing_values[column_name_species] / len(bat)) * 100
    percentage_missing_common_name = (missing_values[column_name_common_name] / len(bat)) * 100
    
    # Print the results
    print(f"Percentage of missing values for '{column_name_species}': {percentage_missing_species:.2f}%")
    print(f"Percentage of missing values for '{column_name_common_name}': {percentage_missing_common_name:.2f}%")
    
    Percentage of missing values for 'species': 50.00%
    Percentage of missing values for 'common_name': 50.00%
    

    Cleaning bat dataset¶

    In [79]:
    # Checking and filling missing values if the columns exist
    if 'Species' in bat.columns:
        bat['Species'] = bat['Species'].replace('Unknown', np.nan)
    if 'Common Name' in bat.columns:
        bat['Common Name'] = bat['Common Name'].replace('Unknown', np.nan)
    
    # Displaying the cleaned data information and the first few rows
    print(bat.info())
    print(bat.head())
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 11 columns):
     #   Column       Non-Null Count  Dtype 
    ---  ------       --------------  ----- 
     0   taxa         10 non-null     object
     1   kingdom      10 non-null     object
     2   phylum       10 non-null     object
     3   class        10 non-null     object
     4   order        10 non-null     object
     5   family       10 non-null     object
     6   genus        10 non-null     object
     7   species      5 non-null      object
     8   common_name  5 non-null      object
     9   park_name    10 non-null     object
     10  location     10 non-null     object
    dtypes: object(11)
    memory usage: 1012.0+ bytes
    None
         taxa   kingdom    phylum     class       order            family  \
    0  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA        MOLOSSIDAE   
    1  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA  VESPERTILIONIDAE   
    2  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA  VESPERTILIONIDAE   
    3  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA        MOLOSSIDAE   
    4  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA        MOLOSSIDAE   
    
              genus    species                 common_name              park_name  \
    0   Mormopterus       None                        None  Royal Botanic Gardens   
    1  Chalinolobus    gouldii         Gould's Wattled Bat        Fitzroy Gardens   
    2  Chalinolobus    gouldii         Gould's Wattled Bat  Royal Botanic Gardens   
    3  Austronomous  australis  White-striped Freetail Bat        Fitzroy Gardens   
    4   Mormopterus       None                        None        Fitzroy Gardens   
    
                                 location  
    0  {'lon': 144.9804, 'lat': -37.8312}  
    1  {'lon': 144.9786, 'lat': -37.8135}  
    2  {'lon': 144.9804, 'lat': -37.8312}  
    3  {'lon': 144.9786, 'lat': -37.8135}  
    4  {'lon': 144.9786, 'lat': -37.8135}  
    
    In [80]:
    # Print the column names to verify the correct column exists
    print(bat.columns)
    
    Index(['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus',
           'species', 'common_name', 'park_name', 'location'],
          dtype='object')
    

    Matching column names for data merge¶

    In [81]:
    # Renaming the 'location' column to 'geopoint'
    bat.rename(columns={'location': 'geopoint'}, inplace=True)
    
    # Displaying the cleaned data information and the first few rows
    print(bat.info())
    print(bat.head())
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 11 columns):
     #   Column       Non-Null Count  Dtype 
    ---  ------       --------------  ----- 
     0   taxa         10 non-null     object
     1   kingdom      10 non-null     object
     2   phylum       10 non-null     object
     3   class        10 non-null     object
     4   order        10 non-null     object
     5   family       10 non-null     object
     6   genus        10 non-null     object
     7   species      5 non-null      object
     8   common_name  5 non-null      object
     9   park_name    10 non-null     object
     10  geopoint     10 non-null     object
    dtypes: object(11)
    memory usage: 1012.0+ bytes
    None
         taxa   kingdom    phylum     class       order            family  \
    0  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA        MOLOSSIDAE   
    1  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA  VESPERTILIONIDAE   
    2  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA  VESPERTILIONIDAE   
    3  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA        MOLOSSIDAE   
    4  Mammal  ANIMALIA  CHORDATA  MAMMALIA  CHIROPTERA        MOLOSSIDAE   
    
              genus    species                 common_name              park_name  \
    0   Mormopterus       None                        None  Royal Botanic Gardens   
    1  Chalinolobus    gouldii         Gould's Wattled Bat        Fitzroy Gardens   
    2  Chalinolobus    gouldii         Gould's Wattled Bat  Royal Botanic Gardens   
    3  Austronomous  australis  White-striped Freetail Bat        Fitzroy Gardens   
    4   Mormopterus       None                        None        Fitzroy Gardens   
    
                                 geopoint  
    0  {'lon': 144.9804, 'lat': -37.8312}  
    1  {'lon': 144.9786, 'lat': -37.8135}  
    2  {'lon': 144.9804, 'lat': -37.8312}  
    3  {'lon': 144.9786, 'lat': -37.8135}  
    4  {'lon': 144.9786, 'lat': -37.8135}  
    

    Dataset 2: Butterfly biodiversity survey 2017¶

    • Dataset Identifier: butterfly-biodiversity-survey-2017

    Dataset Link

    Summary: Comprehensive survey data capturing observations of butterflies across various sites within Melbourne for the year 2017. Details include environmental conditions, vegetation types, and specific butterfly sightings.

    The dataset provides granular details such as temperature, humidity, vegetation details, and wind conditions at the time of each observation, along with precise geographic coordinates. It aims to aid research into butterfly populations and their responses to urban environments.

    Note: Data points include specific environmental conditions and butterfly species observations to better understand the impact of urban settings on biodiversity.

    Note: The dataset is useful for ecological research and conservation planning, providing essential data for studies on biodiversity in urban parks.

    In [4]:
    SENSOR_DATASET = 'butterfly-biodiversity-survey-2017'
    butterfly = fetch_data(BASE_URL, SENSOR_DATASET, API_KEY)
    butterfly.head()
    
    Out[4]:
    site sloc walk date time vegwalktime vegspecies vegfamily lat lon ... tabe brow csem aand jvil paur ogyr gmac datetime location
    0 Womens Peace Gardens 2 1 2017-02-26 0001-01-01T11:42:00+00:00 1.3128 Schinus molle Anacardiaceae -37.7912 144.9244 ... 0 0 0 0 0 0 0 0 2017-02-26T11:42:00+00:00 {'lon': 144.9244, 'lat': -37.7912}
    1 Argyle Square 1 1 2017-11-02 0001-01-01T10:30:00+00:00 0.3051 Rosmarinus officinalis Lamiaceae -37.8023 144.9665 ... 0 0 0 0 0 0 0 0 2017-02-11T10:30:00+00:00 {'lon': 144.9665, 'lat': -37.8023}
    2 Argyle Square 2 1 2017-12-01 0001-01-01T10:35:00+00:00 0.3620 Euphorbia sp. Euphorbiaceae -37.8026 144.9665 ... 0 0 0 0 0 0 0 0 2017-01-12T10:35:00+00:00 {'lon': 144.9665, 'lat': -37.8026}
    3 Westgate Park 4 1 2017-03-03 0001-01-01T11:44:00+00:00 3.1585 Melaleuca lanceolata Myrtaceae -37.8316 144.9089 ... 0 0 0 0 0 0 0 0 2017-03-03T11:44:00+00:00 {'lon': 144.9089, 'lat': -37.8316}
    4 Argyle Square 1 3 2017-01-15 0001-01-01T12:33:00+00:00 0.4432 Mentha sp. Lamiaceae -37.8027 144.9662 ... 0 0 0 0 0 0 0 0 2017-01-15T12:33:00+00:00 {'lon': 144.9662, 'lat': -37.8027}

    5 rows × 42 columns

    In [14]:
    #View info of data
    butterfly.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 4056 entries, 0 to 4055
    Data columns (total 42 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   site         4056 non-null   object 
     1   sloc         4056 non-null   int64  
     2   walk         4056 non-null   int64  
     3   date         4056 non-null   object 
     4   time         4056 non-null   object 
     5   vegwalktime  4052 non-null   float64
     6   vegspecies   4056 non-null   object 
     7   vegfamily    4056 non-null   object 
     8   lat          4056 non-null   float64
     9   lon          4056 non-null   float64
     10  temp         4056 non-null   float64
     11  hum          4056 non-null   float64
     12  win1         4056 non-null   float64
     13  win2         4056 non-null   float64
     14  win3         4056 non-null   float64
     15  win4         4056 non-null   float64
     16  win          4056 non-null   float64
     17  per          4056 non-null   int64  
     18  sur          4056 non-null   int64  
     19  prap         4056 non-null   int64  
     20  vker         4056 non-null   int64  
     21  vite         4056 non-null   int64  
     22  blue         4056 non-null   int64  
     23  dpet         4056 non-null   int64  
     24  dple         4056 non-null   int64  
     25  pana         4056 non-null   int64  
     26  pdem         4056 non-null   int64  
     27  hesp         4056 non-null   int64  
     28  esmi         4056 non-null   int64  
     29  cato         4056 non-null   int64  
     30  gaca         4056 non-null   int64  
     31  belo         4056 non-null   int64  
     32  tabe         4056 non-null   int64  
     33  brow         4056 non-null   int64  
     34  csem         4056 non-null   int64  
     35  aand         4056 non-null   int64  
     36  jvil         4056 non-null   int64  
     37  paur         4056 non-null   int64  
     38  ogyr         4056 non-null   int64  
     39  gmac         4056 non-null   int64  
     40  datetime     4056 non-null   object 
     41  location     4056 non-null   object 
    dtypes: float64(10), int64(25), object(7)
    memory usage: 1.3+ MB
    
    In [5]:
    # Check missing values for dataset
    missing_values = butterfly.isnull().sum()
    missing_values # Number of missing values in each column
    
    Out[5]:
    site           0
    sloc           0
    walk           0
    date           0
    time           0
    vegwalktime    4
    vegspecies     0
    vegfamily      0
    lat            0
    lon            0
    temp           0
    hum            0
    win1           0
    win2           0
    win3           0
    win4           0
    win            0
    per            0
    sur            0
    prap           0
    vker           0
    vite           0
    blue           0
    dpet           0
    dple           0
    pana           0
    pdem           0
    hesp           0
    esmi           0
    cato           0
    gaca           0
    belo           0
    tabe           0
    brow           0
    csem           0
    aand           0
    jvil           0
    paur           0
    ogyr           0
    gmac           0
    datetime       0
    location       0
    dtype: int64

    Matching columns for merging data¶

    In [6]:
    # Renaming columns for consistency
    butterfly = butterfly.rename(columns={
        'date': 'sighting_date',
        'lat': 'latitude',
        'lon': 'longitude',
        'location': 'geopoint'
    })
    
    # Converting 'datetime' column to datetime type and extracting the time part
    butterfly['datetime'] = pd.to_datetime(butterfly['datetime'])
    butterfly['time'] = butterfly['datetime'].dt.time
    
    # Updating the 'geopoint' column to ensure correct format
    butterfly['geopoint'] = butterfly.apply(
        lambda row: f"{row['latitude']}, {row['longitude']}", axis=1
    )
    
    # Display the first few rows of the cleaned dataset
    butterfly.head()
    
    Out[6]:
    site sloc walk sighting_date time vegwalktime vegspecies vegfamily latitude longitude ... tabe brow csem aand jvil paur ogyr gmac datetime geopoint
    0 Womens Peace Gardens 2 1 2017-02-26 11:42:00 1.3128 Schinus molle Anacardiaceae -37.7912 144.9244 ... 0 0 0 0 0 0 0 0 2017-02-26 11:42:00+00:00 -37.7912, 144.9244
    1 Argyle Square 1 1 2017-11-02 10:30:00 0.3051 Rosmarinus officinalis Lamiaceae -37.8023 144.9665 ... 0 0 0 0 0 0 0 0 2017-02-11 10:30:00+00:00 -37.8023, 144.9665
    2 Argyle Square 2 1 2017-12-01 10:35:00 0.3620 Euphorbia sp. Euphorbiaceae -37.8026 144.9665 ... 0 0 0 0 0 0 0 0 2017-01-12 10:35:00+00:00 -37.8026, 144.9665
    3 Westgate Park 4 1 2017-03-03 11:44:00 3.1585 Melaleuca lanceolata Myrtaceae -37.8316 144.9089 ... 0 0 0 0 0 0 0 0 2017-03-03 11:44:00+00:00 -37.8316, 144.9089
    4 Argyle Square 1 3 2017-01-15 12:33:00 0.4432 Mentha sp. Lamiaceae -37.8027 144.9662 ... 0 0 0 0 0 0 0 0 2017-01-15 12:33:00+00:00 -37.8027, 144.9662

    5 rows × 42 columns

    Dataset 3: Bioblitz 2016¶

    • Dataset Identifier: bioblitz-2016

    Dataset Link

    Summary: This dataset records a variety of living organisms spotted during the BioBlitz event in Melbourne in 2016. It catalogues diverse species ranging from molluscs to annelids and includes detailed taxonomic information.

    Data entries are detailed with the taxonomy from kingdom to species level where available, common names, and precise geo-coordinates of each sighting. Identification notes and resource names provide context about the sighting sources and identification methods.

    Note: Observations were gathered through community and expert contributions during the BioBlitz event, aimed at cataloguing as many species as possible within a short time frame.

    Note: This dataset serves as a valuable resource for ecological studies and environmental education, offering insights into the local biodiversity of Melbourne.

    In [87]:
    SENSOR_DATASET = 'bioblitz-2016'
    bio = fetch_data(BASE_URL, SENSOR_DATASET, API_KEY)
    bio.head()
    
    Out[87]:
    taxa kingdom phylum class order family genus species common_name identification_notes data_resource_name sighting_date latitude longitude location geopoint
    0 Mollusc ANIMALIA MOLLUSCA None None None None None None None Participate Melbourne 2016-03-02 -37.8298 144.9002 None {'lon': 144.9002, 'lat': -37.8298}
    1 Insect ANIMALIA ARTHROPODA INSECTA None None None None None Insect Bowerbird 2016-03-20 -37.8185 144.9748 None {'lon': 144.9748, 'lat': -37.8185}
    2 Annelid ANIMALIA ANNELIDA OLIGOCHAETA None None None None None Earthworm Handwritten 2016-03-04 -37.8060 144.9710 None {'lon': 144.971, 'lat': -37.806}
    3 Annelid ANIMALIA ANNELIDA OLIGOCHAETA None None None None None Freshwater Oligochaete Worm Handwritten 2016-03-04 -37.8290 144.9800 None {'lon': 144.98, 'lat': -37.829}
    4 Amphibian ANIMALIA CHORDATA AMPHIBIA ANURA HYLIDAE Litoria None None Tiny Frog Participate Melbourne 2016-03-16 -37.8204 145.2496 None {'lon': 145.2496, 'lat': -37.8204}
    In [18]:
    #View info of data
    bio.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1356 entries, 0 to 1355
    Data columns (total 16 columns):
     #   Column                Non-Null Count  Dtype  
    ---  ------                --------------  -----  
     0   taxa                  1353 non-null   object 
     1   kingdom               1353 non-null   object 
     2   phylum                1322 non-null   object 
     3   class                 1314 non-null   object 
     4   order                 1177 non-null   object 
     5   family                1139 non-null   object 
     6   genus                 953 non-null    object 
     7   species               846 non-null    object 
     8   common_name           802 non-null    object 
     9   identification_notes  588 non-null    object 
     10  data_resource_name    1356 non-null   object 
     11  sighting_date         1350 non-null   object 
     12  latitude              1356 non-null   float64
     13  longitude             1353 non-null   float64
     14  location              0 non-null      object 
     15  geopoint              1353 non-null   object 
    dtypes: float64(2), object(14)
    memory usage: 169.6+ KB
    
    In [19]:
    # Check missing values for dataset
    missing_values = bio.isnull().sum()
    missing_values # Number of missing values in each column
    
    Out[19]:
    taxa                       3
    kingdom                    3
    phylum                    34
    class                     42
    order                    179
    family                   217
    genus                    403
    species                  510
    common_name              554
    identification_notes     768
    data_resource_name         0
    sighting_date              6
    latitude                   0
    longitude                  3
    location                1356
    geopoint                   3
    dtype: int64

    Cleaning the Bioblitz dataset.¶

    In [88]:
    # Filling missing values for categorical data with 'Unknown'
    categorical_columns = bio.select_dtypes(include=['object']).columns
    bio[categorical_columns] = bio[categorical_columns].fillna('Unknown')
    
    # Removing the 'Location' column if it exists in the DataFrame
    if 'Location' in bio.columns:
        bio = bio.drop(columns=['Location'])
    
    # Dropping rows with essential missing data in 'Taxa' and 'Kingdom'
    if 'Taxa' in bio.columns and 'Kingdom' in bio.columns:
        bio = bio.dropna(subset=['Taxa', 'Kingdom'])
    
    # Displaying the cleaned data information and the first few rows
    print(bio.info())
    print(bio.head())
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1356 entries, 0 to 1355
    Data columns (total 16 columns):
     #   Column                Non-Null Count  Dtype  
    ---  ------                --------------  -----  
     0   taxa                  1356 non-null   object 
     1   kingdom               1356 non-null   object 
     2   phylum                1356 non-null   object 
     3   class                 1356 non-null   object 
     4   order                 1356 non-null   object 
     5   family                1356 non-null   object 
     6   genus                 1356 non-null   object 
     7   species               1356 non-null   object 
     8   common_name           1356 non-null   object 
     9   identification_notes  1356 non-null   object 
     10  data_resource_name    1356 non-null   object 
     11  sighting_date         1356 non-null   object 
     12  latitude              1356 non-null   float64
     13  longitude             1353 non-null   float64
     14  location              1356 non-null   object 
     15  geopoint              1356 non-null   object 
    dtypes: float64(2), object(14)
    memory usage: 169.6+ KB
    None
            taxa   kingdom      phylum        class    order   family    genus  \
    0    Mollusc  ANIMALIA    MOLLUSCA      Unknown  Unknown  Unknown  Unknown   
    1     Insect  ANIMALIA  ARTHROPODA      INSECTA  Unknown  Unknown  Unknown   
    2    Annelid  ANIMALIA    ANNELIDA  OLIGOCHAETA  Unknown  Unknown  Unknown   
    3    Annelid  ANIMALIA    ANNELIDA  OLIGOCHAETA  Unknown  Unknown  Unknown   
    4  Amphibian  ANIMALIA    CHORDATA     AMPHIBIA    ANURA  HYLIDAE  Litoria   
    
       species common_name         identification_notes     data_resource_name  \
    0  Unknown     Unknown                      Unknown  Participate Melbourne   
    1  Unknown     Unknown                       Insect              Bowerbird   
    2  Unknown     Unknown                    Earthworm            Handwritten   
    3  Unknown     Unknown  Freshwater Oligochaete Worm            Handwritten   
    4  Unknown     Unknown                    Tiny Frog  Participate Melbourne   
    
      sighting_date  latitude  longitude location  \
    0    2016-03-02  -37.8298   144.9002  Unknown   
    1    2016-03-20  -37.8185   144.9748  Unknown   
    2    2016-03-04  -37.8060   144.9710  Unknown   
    3    2016-03-04  -37.8290   144.9800  Unknown   
    4    2016-03-16  -37.8204   145.2496  Unknown   
    
                                 geopoint  
    0  {'lon': 144.9002, 'lat': -37.8298}  
    1  {'lon': 144.9748, 'lat': -37.8185}  
    2    {'lon': 144.971, 'lat': -37.806}  
    3     {'lon': 144.98, 'lat': -37.829}  
    4  {'lon': 145.2496, 'lat': -37.8204}  
    

    Merge Bat and Bioblitz dataset¶

    In [89]:
    # Merge the two datasets on shared columns
    merged = pd.merge(bio, bat, on=['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'common_name'],
                      how='outer')
    
    # Dropping specified columns
    columns_to_drop = ['identification_notes', 'data_resource_name', 'sighting_date', 
                       'latitude', 'longitude', 'location', 'park_name']
    merged_cleaned = merged.drop(columns=columns_to_drop)
    
    # Combining geopoint columns
    merged_cleaned['geopoint'] = merged_cleaned['geopoint_x'].combine_first(merged_cleaned['geopoint_y'])
    merged_cleaned = merged_cleaned.drop(columns=['geopoint_x', 'geopoint_y'])
    
    # Save the cleaned merged dataset
    merged_cleaned.to_csv('/Users/francisrusli/desktop/merged.csv', index=False)
    merged_cleaned.head()
    
    Out[89]:
    taxa kingdom phylum class order family genus species common_name geopoint
    0 Mollusc ANIMALIA MOLLUSCA Unknown Unknown Unknown Unknown Unknown Unknown {'lon': 144.9002, 'lat': -37.8298}
    1 Mollusc ANIMALIA MOLLUSCA Unknown Unknown Unknown Unknown Unknown Unknown {'lon': 133.7751, 'lat': -25.2744}
    2 Mollusc ANIMALIA MOLLUSCA Unknown Unknown Unknown Unknown Unknown Unknown {'lon': 144.9376, 'lat': -37.7729}
    3 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown Unknown {'lon': 144.9748, 'lat': -37.8185}
    4 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown Unknown {'lon': 144.9634, 'lat': -37.7908}

    Part 2 (Analysis)¶

    Removing unknown classes and show unique values¶

    In [90]:
    # Assuming 'Unknown' or NaN are the placeholders for unknown values in the 'class' column
    merged_cleaned = merged_cleaned[merged_cleaned['class'].notna()]
    merged_cleaned = merged_cleaned[merged_cleaned['class'] != 'Unknown']
    
    # You can check the effect by looking at the unique values in the 'class' column again
    print(merged_cleaned['class'].unique())
    
    ['INSECTA' 'OLIGOCHAETA' 'AMPHIBIA' 'ARACHNIDA' 'EQUISETOPSIDA'
     'POLYCHAETA' 'CLITELLATA' 'NEMERTINEA' 'CHONDRICHTHYES' 'AVES'
     'AGARICOMYCETES' 'ANTHOZOA' 'MALACOSTRACA' 'MAXILLOPODA' 'DIPLOPOD'
     'ASTEROIDEA' 'ACTINOPTERYGII' 'LECANOROMYCETES' 'BIVALVIA' 'MAMMALIA'
     'GASTROPODA' 'BRYOPSIDOPHYCEAE' 'FLORIDEOPHYCEAE' 'GINKGOOPSIDA'
     'REPTILIA' 'ASCIDIACEA' 'SCYPHOZOA' 'OSTROCODA' 'ECHINOIDEA' 'LILIOPSIDA'
     'CHILOPODA' 'ECHIURIDEA' 'ULVOPHYCEAE']
    

    Creating a histogram of the different classes¶

    In [91]:
    import matplotlib.pyplot as plt
    
    # Assuming 'data' is your DataFrame
    class_counts = merged_cleaned['class'].value_counts()
    
    plt.figure(figsize=(10, 5))
    class_counts.plot(kind='bar', color='skyblue')
    plt.title('Histogram of Class')
    plt.xlabel('Class')
    plt.ylabel('Counts')
    plt.xticks(rotation=90)
    plt.show()
    
    No description has been provided for this image

    Pie Chart of all the Phylum¶

    In [92]:
    # Calculate the value counts for the 'phylum' column
    phylum_counts = merged_cleaned['phylum'].value_counts()
    
    # Define a threshold for 'Other' category. You can adjust the threshold as needed.
    threshold_percent = 5  # Percentage considered as 'Other'
    other_threshold = sum(phylum_counts) * (threshold_percent / 100)
    
    # Combine smaller categories into 'Other'
    other = phylum_counts[phylum_counts < other_threshold].sum()
    main_phylum_counts = phylum_counts[phylum_counts >= other_threshold]
    main_phylum_counts['Other'] = other
    
    # Create a pie chart
    plt.figure(figsize=(8, 8))
    plt.pie(main_phylum_counts, labels=main_phylum_counts.index, autopct='%1.1f%%', shadow=True, startangle=90)
    plt.title('Pie Chart of Phylum with "Other" Category')
    plt.show()
    
    No description has been provided for this image

    Splitting geopoint column into lat an long¶

    In [26]:
    # Function to safely extract longitude and latitude
    def extract_lat_lon(geopoint):
        if isinstance(geopoint, str):
            try:
                geopoint = eval(geopoint)  # Convert string to dict
            except:
                return None, None  # Return None if eval fails
        # Check if the dictionary has 'lat' and 'lon' keys
        if isinstance(geopoint, dict) and 'lon' in geopoint and 'lat' in geopoint:
            return geopoint['lat'], geopoint['lon']
        else:
            return None, None  # Return None if keys are not present
    
    # Apply the function to the 'geopoint' column
    merged_cleaned[['latitude', 'longitude']] = merged_cleaned['geopoint'].apply(extract_lat_lon).apply(pd.Series)
    
    # Show the head of the DataFrame to confirm the new columns
    print(merged_cleaned.head())
    
         taxa   kingdom      phylum    class    order   family    genus  species  \
    3  Insect  ANIMALIA  ARTHROPODA  INSECTA  Unknown  Unknown  Unknown  Unknown   
    4  Insect  ANIMALIA  ARTHROPODA  INSECTA  Unknown  Unknown  Unknown  Unknown   
    5  Insect  ANIMALIA  ARTHROPODA  INSECTA  Unknown  Unknown  Unknown  Unknown   
    6  Insect  ANIMALIA  ARTHROPODA  INSECTA  Unknown  Unknown  Unknown  Unknown   
    7  Insect  ANIMALIA  ARTHROPODA  INSECTA  Unknown  Unknown  Unknown  Unknown   
    
      common_name                            geopoint  latitude  longitude  
    3     Unknown  {'lon': 144.9748, 'lat': -37.8185}  -37.8185   144.9748  
    4     Unknown  {'lon': 144.9634, 'lat': -37.7908}  -37.7908   144.9634  
    5     Unknown  {'lon': 144.9706, 'lat': -37.8216}  -37.8216   144.9706  
    6     Unknown  {'lon': 144.9606, 'lat': -37.7986}  -37.7986   144.9606  
    7     Unknown  {'lon': 144.9564, 'lat': -37.7918}  -37.7918   144.9564  
    
    In [27]:
    # Drop the original 'geopoint' column
    merged_cleaned.drop(columns=['geopoint'], inplace=True)
    

    Taxa Distribution¶

    In [28]:
    # Plot a histogram for the distribution of different taxa
    plt.figure(figsize=(10, 6))
    merged_cleaned['taxa'].value_counts().plot(kind='bar', color='skyblue')
    plt.title('Distribution of Taxa in Melbourne\'s Parks and Green Spaces')
    plt.xlabel('Taxa')
    plt.ylabel('Frequency')
    plt.xticks(rotation=90)
    plt.grid(True)
    plt.show()
    
    No description has been provided for this image

    Inspect dataset for taxa¶

    In [29]:
    import folium
    
    # Assuming you've already cleaned and prepared 'merged_cleaned' DataFrame
    melbourne_coordinates = [-37.814, 144.96332]
    
    # Create a Folium map centered around Melbourne
    m = folium.Map(location=melbourne_coordinates, zoom_start=12)
    
    # Print unique taxa
    unique_taxa = merged_cleaned['taxa'].unique()
    print("Unique Taxa in the Dataset:")
    print(unique_taxa)
    
    # Define a color mapping for different taxa
    taxa_mapping = {
        'Mollusc': {'color': 'green', 'icon': 'glyphicon-leaf'},
        'Insect': {'color': 'red', 'icon': 'glyphicon-bug'},
        'Bird': {'color': 'blue', 'icon': 'glyphicon-bird'},
        'Mammal': {'color': 'gray', 'icon': 'glyphicon-knight'},
        # Add other taxa and customize icons as needed
    }
    
    # Loop through each row in the DataFrame to add markers, ensuring no NaN coordinates
    for _, row in merged_cleaned.dropna(subset=['latitude', 'longitude']).iterrows():
        lat, lng = row['latitude'], row['longitude']
        taxa = row['taxa']
        if taxa in taxa_mapping:
            marker_color = taxa_mapping[taxa]['color']
            marker_icon = taxa_mapping[taxa]['icon']
        else:
            marker_color = 'purple'  # default color
            marker_icon = 'glyphicon-question-sign'  # default icon
    
        # Add markers to the map
        folium.Marker(
            location=[lat, lng],
            popup=f"{taxa} - {row['common_name']}",
            icon=folium.Icon(color=marker_color, icon=marker_icon)
        ).add_to(m)
    
    # Display the map
    m
    
    Unique Taxa in the Dataset:
    ['Insect' 'Annelid' 'Amphibian' 'Arachnid' 'Plant' 'Stingray' 'Bird'
     'Fungi' 'Cnidaria' 'Crustacean' 'Diplopod' 'Echinoderm' 'Fish' 'Mollusc'
     'Lichen' 'Mammal' 'Reptile' 'Ascidian' 'Nematode' 'Chilopod' 'Echiura']
    
    Out[29]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    Focusing only on bats¶

    Chiroptera is the name of the order of the only mammal capable of true flight, the bat. The name is influenced by the hand-like wings of bats, which are formed from four elongated "fingers" covered by a cutaneous membrane.

    Vespertilionidae is a family of microbats, of the order Chiroptera, flying, insect-eating mammals variously described as the common, vesper, or simple nosed bats.

    In [30]:
    merged_cleaned
    
    Out[30]:
    taxa kingdom phylum class order family genus species common_name latitude longitude
    3 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown Unknown -37.8185 144.9748
    4 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown Unknown -37.7908 144.9634
    5 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown Unknown -37.8216 144.9706
    6 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown Unknown -37.7986 144.9606
    7 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown Unknown -37.7918 144.9564
    ... ... ... ... ... ... ... ... ... ... ... ...
    1359 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE Mormopterus None None -37.8135 144.9786
    1360 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VESPERTILIONIDAE Myotis macropus Large-footed Myotis -37.8312 144.9804
    1361 Mammal ANIMALIA CHORDATA MAMMALIA VESPERTILIONIDAE MOLOSSIDAE Scotorepens None None -37.8312 144.9804
    1362 Mammal ANIMALIA CHORDATA MAMMALIA VESPERTILIONIDAE MOLOSSIDAE Scotorepens None None -37.8135 144.9786
    1363 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VEPSERTILIONIDAE Nyctophilus None None -37.8312 144.9804

    1322 rows × 11 columns

    In [31]:
    # Check for NaN values in latitude and longitude
    nan_latitude = merged_cleaned['latitude'].isna().sum()
    nan_longitude = merged_cleaned['longitude'].isna().sum()
    print(f"Number of NaN values in latitude: {nan_latitude}")
    print(f"Number of NaN values in longitude: {nan_longitude}")
    
    Number of NaN values in latitude: 0
    Number of NaN values in longitude: 0
    

    Mapping of bats based on their unique locations¶

    In [32]:
    # Filter for entries where the order is either 'CHIROPTERA' or 'VESPERTILIONIDAE'
    bats_data = merged_cleaned[
        merged_cleaned['order'].str.upper().isin(['CHIROPTERA', 'VESPERTILIONIDAE'])
    ]
    
    # Summary of the filtered data
    print("Summary of Bat Observations:")
    print(f"Total records: {bats_data.shape[0]}")
    print(f"Unique species: {bats_data['species'].nunique()}")
    common_species = bats_data['species'].mode().values
    print(f"Common species: {common_species if common_species.size > 0 else 'None'}")
    print(f"Unique locations: {bats_data[['latitude', 'longitude']].dropna().drop_duplicates().shape[0]}")
    
    # Create a map centered around the approximate locations of the bat observations
    bat_map = folium.Map(location=[-37.8311, 144.9452], zoom_start=12)
    
    # Add markers for each bat observation
    for idx, row in bats_data.dropna(subset=['latitude', 'longitude']).iterrows():
        species_info = f"{row['genus']} {row['species']} - {row['common_name']}"
        folium.Marker(
            location=[row['latitude'], row['longitude']],
            popup=species_info,
            icon=folium.Icon(color='red', icon='glyphicon-tint')
        ).add_to(bat_map)
    
    # Display the map directly in Jupyter (optional)
    bat_map
    
    Summary of Bat Observations:
    Total records: 17
    Unique species: 5
    Common species: ['poliocephalus']
    Unique locations: 7
    
    Out[32]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    Piechart of bat species distribution¶

    The pie chart of bat species distribution visualizes the relative frequencies of different bat species in the dataset. Each slice of the pie chart represents a specific species, with the size of the slice corresponding to the proportion of observations of that species in comparison to the total observations across all species.

    • Quantitative Comparison: This chart provides a quick and easy way to compare how common each bat species is within the studied area. Larger slices indicate more commonly observed species, while smaller slices represent less frequent ones.
    • Diversity Insight: It helps in understanding the biodiversity of bats in the region by showing the variety of species and their relative abundance.
    • Conservation Priorities: For conservation efforts, knowing which species are more or less common can help prioritize actions, especially if some of the less common species are also known to be at risk.
    In [33]:
    # Ensure the species column does not have too many unique categories
    species_counts = bats_data['species'].value_counts()
    top_species = species_counts.head(10)  # You can adjust to include more species if needed
    
    # Create a pie chart
    plt.figure(figsize=(10, 7))
    plt.pie(top_species, labels=top_species.index, autopct='%1.1f%%', startangle=140)
    plt.title('Distribution of Bat Species')
    plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    plt.show()
    
    No description has been provided for this image

    Boxplot of species richness¶

    Species richness refers to the number of different species present in a given ecological community, region, or habitat. It is a measure of biodiversity that does not account for the abundance of species, only their presence.

    In [34]:
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Calculate species richness (number of unique species)
    species_richness = bats_data['species'].nunique()
    
    # Calculate the distribution of observations per species
    species_distribution = bats_data['species'].value_counts()
    
    # Plot the distribution of observations per species
    plt.figure(figsize=(12, 6))
    sns.barplot(x=species_distribution.index, y=species_distribution.values, palette="viridis")
    plt.title('Distribution of Bat Observations by Species')
    plt.xlabel('Species')
    plt.ylabel('Number of Observations')
    plt.xticks(rotation=45)
    plt.show()
    
    print(f"Species Richness: {species_richness}")
    
    No description has been provided for this image
    Species Richness: 5
    

    Heatmap¶

    The intensity of colors illustrates the density of bat observations across different geographic regions or shows correlations between various environmental factors such as temperature and bat activity. By displaying data on a color gradient, heatmaps allow researchers and conservationists to quickly identify hotspots of bat activity, understand habitat preferences, and discern patterns that may influence bat behavior.

    In [35]:
    from folium.plugins import HeatMap
    
    # Create a map centered around the approximate locations of the bat observations
    heat_map = folium.Map(location=[-37.8311, 144.9452], zoom_start=12)
    
    # Add a heat map layer
    heat_data = [[row['latitude'], row['longitude']] for index, row in bats_data.dropna(subset=['latitude', 'longitude']).iterrows()]
    HeatMap(heat_data).add_to(heat_map)
    
    # Display the map directly in Jupyter (optional)
    heat_map
    
    Out[35]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    Clustering the Bat data¶

    In [36]:
    from folium.plugins import MarkerCluster
    
    # Create a map centered around the average coordinates
    bat_map = folium.Map(location=[bats_data['latitude'].mean(), bats_data['longitude'].mean()], zoom_start=12)
    
    # Create a marker cluster
    marker_cluster = MarkerCluster().add_to(bat_map)
    
    # Add markers to the cluster instead of the map
    for idx, row in bats_data.dropna(subset=['latitude', 'longitude']).iterrows():
        species_info = f"{row['genus']} {row['species']} - {row['common_name']}"
        folium.Marker(
            location=[row['latitude'], row['longitude']],
            popup=species_info,
            icon=folium.Icon(color='red', icon='glyphicon-tint')
        ).add_to(marker_cluster)
    
    # display the map
    bat_map 
    
    Out[36]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    Butterfly Biodiversity¶

    In [7]:
    butterfly
    
    Out[7]:
    site sloc walk sighting_date time vegwalktime vegspecies vegfamily latitude longitude ... tabe brow csem aand jvil paur ogyr gmac datetime geopoint
    0 Womens Peace Gardens 2 1 2017-02-26 11:42:00 1.3128 Schinus molle Anacardiaceae -37.7912 144.9244 ... 0 0 0 0 0 0 0 0 2017-02-26 11:42:00+00:00 -37.7912, 144.9244
    1 Argyle Square 1 1 2017-11-02 10:30:00 0.3051 Rosmarinus officinalis Lamiaceae -37.8023 144.9665 ... 0 0 0 0 0 0 0 0 2017-02-11 10:30:00+00:00 -37.8023, 144.9665
    2 Argyle Square 2 1 2017-12-01 10:35:00 0.3620 Euphorbia sp. Euphorbiaceae -37.8026 144.9665 ... 0 0 0 0 0 0 0 0 2017-01-12 10:35:00+00:00 -37.8026, 144.9665
    3 Westgate Park 4 1 2017-03-03 11:44:00 3.1585 Melaleuca lanceolata Myrtaceae -37.8316 144.9089 ... 0 0 0 0 0 0 0 0 2017-03-03 11:44:00+00:00 -37.8316, 144.9089
    4 Argyle Square 1 3 2017-01-15 12:33:00 0.4432 Mentha sp. Lamiaceae -37.8027 144.9662 ... 0 0 0 0 0 0 0 0 2017-01-15 12:33:00+00:00 -37.8027, 144.9662
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    4051 Fitzroy-Treasury Gardens 3 2 2017-06-02 17:44:00 0.5132 Tagetes sp. Asteraceae -37.8136 144.9819 ... 0 0 0 0 0 0 0 0 2017-02-06 17:44:00+00:00 -37.8136, 144.9819
    4052 Westgate Park 4 2 2017-02-02 13:57:00 2.1947 Myoporum parvifolium Scrophulariaceae -37.8311 144.9092 ... 0 0 0 0 0 0 0 0 2017-02-02 13:57:00+00:00 -37.8311, 144.9092
    4053 Westgate Park 5 3 2017-06-03 15:43:00 4.2408 Cassinia arcuata Asteraceae -37.8299 144.9106 ... 0 0 0 0 0 0 0 0 2017-03-06 15:43:00+00:00 -37.8299, 144.9106
    4054 Westgate Park 4 1 2017-02-02 11:05:00 1.5948 Xerochrysum viscosum Asteraceae -37.8316 144.9093 ... 0 0 0 0 0 0 0 0 2017-02-02 11:05:00+00:00 -37.8316, 144.9093
    4055 Carlton Gardens South 3 1 2017-01-30 12:42:00 1.4437 Asteraceae 1 Asteraceae -37.8044 144.9704 ... 0 0 0 0 0 0 0 0 2017-01-30 12:42:00+00:00 -37.8044, 144.9704

    4056 rows × 42 columns

    Number of unique family and species¶

    In [8]:
    # Calculate the number of unique species
    unique_species = butterfly['vegspecies'].nunique()
    
    # Calculate the number of unique families
    unique_families = butterfly['vegfamily'].nunique()
    
    print(f"Number of unique vegetation species: {unique_species}")
    print(f"Number of unique vegetation families: {unique_families}")
    
    Number of unique vegetation species: 134
    Number of unique vegetation families: 59
    

    Summary of the number of sightings at each site¶

    In [9]:
    # Count the number of sightings at each site
    sightings_per_site = butterfly['site'].value_counts()
    
    # Bar plot of the number of sightings per site
    plt.figure(figsize=(12, 8))
    sightings_per_site.plot(kind='bar')
    plt.title('Number of Sightings per Site')
    plt.xlabel('Site')
    plt.ylabel('Sightings Count')
    plt.xticks(rotation=45, ha='right')  # Rotate the x labels for better readability
    plt.tight_layout()  # Adjust layout
    plt.show()
    
    No description has been provided for this image

    Average temperature and humidity of each site¶

    In [40]:
    # We first group the data by 'site' and calculate the mean for the 'temp' and 'hum' columns
    site_comparison = butterfly.groupby('site').agg({'temp':'mean', 'hum':'mean'}).reset_index()
    
    # Now let's create a bar plot for average temperature by site
    plt.figure(figsize=(12, 8))
    sns.barplot(x='site', y='temp', data=site_comparison, palette='coolwarm')
    plt.title('Average Temperature by Site')
    plt.xlabel('Site')
    plt.ylabel('Average Temperature (°C)')
    plt.xticks(rotation=45, ha='right')  # Rotate the x labels for better readability
    plt.tight_layout()  # Adjust layout
    plt.show()
    
    # And a bar plot for average humidity by site
    plt.figure(figsize=(12, 8))
    sns.barplot(x='site', y='hum', data=site_comparison, palette='coolwarm')
    plt.title('Average Humidity by Site')
    plt.xlabel('Site')
    plt.ylabel('Average Humidity (%)')
    plt.xticks(rotation=45, ha='right')  # Rotate the x labels for better readability
    plt.tight_layout()  # Adjust layout
    plt.show()
    
    No description has been provided for this image
    No description has been provided for this image

    Top 10 Butterfly family¶

    In [41]:
    # Count occurrences of each vegetation family
    vegfamily_counts = butterfly['vegfamily'].value_counts()
    
    # Number of unique families
    unique_families = len(vegfamily_counts)
    
    # Print the number of unique families
    print(f'Number of unique vegetation families: {unique_families}')
    
    # Print the top 10 families
    print('Top 10 vegetation families by occurrence:')
    print(vegfamily_counts.head(10))
    
    Number of unique vegetation families: 59
    Top 10 vegetation families by occurrence:
    vegfamily
    Asteraceae        644
    Fabaceae          544
    Lamiaceae         444
    Myrtaceae         152
    Plumbaginaceae    148
    Anacardiaceae     140
    Goodeniaceae      124
    Campanulaceae     116
    Brassicaceae      112
    Pittosporaceae    112
    Name: count, dtype: int64
    

    Family histogram¶

    In [42]:
    # Count occurrences of each vegetation family
    vegfamily_counts = butterfly['vegfamily'].value_counts()
    
    # Bar plot
    plt.figure(figsize=(12, 8))
    vegfamily_counts.plot(kind='bar', color='cadetblue')
    plt.title('Distribution of Vegetation Families')
    plt.xlabel('Vegetation Family')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45, ha='right')  # Rotate labels to improve readability
    plt.tight_layout()  # Adjust layout to make room for label rotation
    plt.show()
    
    No description has been provided for this image

    Unique Species count¶

    In [43]:
    # Count occurrences of each vegetation species
    vegspecies_counts = butterfly['vegspecies'].value_counts()
    
    # Number of unique species
    unique_species = len(vegspecies_counts)
    
    # Print the number of unique species
    print(f'Number of unique vegetation species: {unique_species}')
    
    # Print the top 10 species
    print('Top 10 vegetation species by occurrence:')
    print(vegspecies_counts.head(10))
    
    Number of unique vegetation species: 134
    Top 10 vegetation species by occurrence:
    vegspecies
    Trifolium repens         412
    Asteraceae 1             244
    Schinus molle            140
    Goodenia ovata           124
    Wahlenbergia sp.         116
    Bursaria spinosa         112
    Raphanus raphanistrum    112
    Galenia pubescens         96
    Salvia sp.                88
    Canna generalis           88
    Name: count, dtype: int64
    

    Temperature distribution¶

    In [44]:
    # Temperature histogram
    plt.figure(figsize=(8, 6))
    plt.hist(butterfly['temp'], bins=20, color='skyblue', edgecolor='black')
    plt.title('Temperature Distribution')
    plt.xlabel('Temperature (°C)')
    plt.ylabel('Frequency')
    plt.show()
    
    No description has been provided for this image

    Humidity distribution¶

    In [45]:
    # Humidity histogram
    plt.figure(figsize=(8, 6))
    plt.hist(butterfly['hum'], bins=20, color='lightgreen', edgecolor='black')
    plt.title('Humidity Distribution')
    plt.xlabel('Humidity (%)')
    plt.ylabel('Frequency')
    plt.show()
    
    No description has been provided for this image

    Butterfly sightings¶

    In [46]:
    # Convert the sighting_date to datetime format
    butterfly['sighting_date'] = pd.to_datetime(butterfly['sighting_date'])
    
    # Summing example butterfly counts (using real columns from your dataset if available)
    butterfly['total_sightings'] = butterfly[['blue', 'dpet', 'dple']].sum(axis=1)
    
    # Group by date and sum sightings
    sightings_by_date = butterfly.groupby('sighting_date')['total_sightings'].sum()
    
    # Plotting
    plt.figure(figsize=(12, 6))
    plt.plot(sightings_by_date)
    plt.title('Butterfly Sightings Over Time')
    plt.xlabel('Date')
    plt.ylabel('Total Sightings')
    plt.grid(True)
    plt.show()
    
    No description has been provided for this image

    Temperature and humidity boxplot¶

    In [47]:
    # Creating box plots for Temperature grouped by Vegetation Family
    plt.figure(figsize=(12, 8))
    sns.boxplot(x='vegfamily', y='temp', data=butterfly)
    plt.title('Temperature Distribution by Vegetation Family')
    plt.xlabel('Vegetation Family')
    plt.ylabel('Temperature (°C)')
    plt.xticks(rotation=90)  # Rotate the x labels for better readability
    plt.tight_layout()  # Adjust layout
    plt.show()
    
    # Creating box plots for Humidity grouped by Vegetation Family
    plt.figure(figsize=(12, 8))
    sns.boxplot(x='vegfamily', y='hum', data=butterfly)
    plt.title('Humidity Distribution by Vegetation Family')
    plt.xlabel('Vegetation Family')
    plt.ylabel('Humidity (%)')
    plt.xticks(rotation=90)  # Rotate the x labels for better readability
    plt.tight_layout()  # Adjust layout
    plt.show()
    
    No description has been provided for this image
    No description has been provided for this image
    In [48]:
    plt.figure(figsize=(14, 10))  # Adjusted figure size
    sns.boxplot(y='vegfamily', x='temp', data=butterfly)
    plt.title('Temperature Distribution by Vegetation Family')
    plt.ylabel('Vegetation Family')
    plt.xlabel('Temperature (°C)')
    plt.tight_layout()  # Adjust layout
    plt.show()
    
    No description has been provided for this image

    Spread and Distribution: Each box plot represents the spread and central tendency of temperatures observed for each vegetation family. The bottom and top of each box are the first and third quartiles, and the band inside the box is the median. The whiskers extend to show the range of the data, and points outside of these are considered outliers.

    Temperature Ranges: There is a variety in temperature ranges across different vegetation families. Some families have a wider range of temperatures where butterflies have been sighted, indicated by longer boxes and whiskers. Other families have a more narrow range, shown by shorter boxes and whiskers.

    Outliers: There are a few outliers present in several vegetation families. Outliers are the individual points that occur far away from the general cluster of data points, indicated by the diamonds outside of the whiskers. These could represent days with unusually high or low temperatures for sightings associated with those vegetation families.

    Butterfly scatterplot¶

    In [49]:
    # Scatter plot
    plt.figure(figsize=(10, 6))
    plt.scatter(butterfly['temp'], butterfly['hum'], alpha=0.5)
    plt.title('Scatter Plot of Temperature vs. Humidity')
    plt.xlabel('Temperature (°C)')
    plt.ylabel('Humidity (%)')
    plt.grid(True)
    plt.show()
    
    No description has been provided for this image

    Mapping all data points¶

    In [50]:
    # Create a map centered around Melbourne
    all_points_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)
    
    # Add markers for each butterfly observation
    for idx, row in butterfly.dropna(subset=['latitude', 'longitude']).iterrows():
        species_info = f"{row['vegspecies']} - {row['vegfamily']}"  # Update with relevant information
        folium.Marker(
            location=[row['latitude'], row['longitude']],
            popup=species_info
        ).add_to(all_points_map)
    
    # Display the map
    all_points_map
    
    Out[50]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    Creating clusters¶

    In [51]:
    # Create a map centered around Melbourne
    cluster_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)
    
    # Create a MarkerCluster object
    marker_cluster = MarkerCluster().add_to(cluster_map)
    
    # Add clustered markers for each butterfly observation
    for idx, row in butterfly.dropna(subset=['latitude', 'longitude']).iterrows():
        species_info = f"{row['vegspecies']} - {row['vegfamily']}"  # Update with relevant information
        folium.Marker(
            location=[row['latitude'], row['longitude']],
            popup=species_info
        ).add_to(marker_cluster)
    
    # Display the map
    cluster_map
    
    Out[51]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    Heatmap¶

    In [52]:
    # Create a map centered around Melbourne
    heatmap_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)
    
    # Prepare data for HeatMap
    heatmap_data = butterfly[['latitude', 'longitude']].dropna().values.tolist()
    
    # Add HeatMap layer
    HeatMap(heatmap_data).add_to(heatmap_map)
    
    # Display the map
    heatmap_map
    
    Out[52]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    Temperature and humidity correlation matrix heatmap¶

    In [53]:
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Selecting only numeric columns for correlation - adjust this list as necessary
    numeric_columns = ['temp', 'hum']  # Add other numeric columns as needed
    butterfly_numeric = butterfly[numeric_columns]
    
    # Calculating the correlation matrix
    correlation_matrix = butterfly_numeric.corr()
    
    # Creating the heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm',
                xticklabels=correlation_matrix.columns,
                yticklabels=correlation_matrix.columns)
    
    # Showing the plot
    plt.title('Correlation Heatmap of Butterfly Dataset Variables')
    plt.show()
    
    No description has been provided for this image

    Temperature (temp) and Humidity (hum) Relationship: There is a negative correlation of -0.67 between temperature and humidity. This indicates a moderately strong inverse relationship, meaning that as temperature increases, humidity tends to decrease, and vice versa within the dataset's observations.

    Strength of Correlation: The value of -0.67 is not close to -1, which means the relationship, while negative, is not perfectly linear and other factors may also influence the observed humidity and temperature values.

    Part 3: Predicting Presence of a Species¶

    In [12]:
    # Create a binary target variable indicating the presence of species in the Asteraceae family
    butterfly['target'] = (butterfly['vegfamily'] == 'Asteraceae').astype(int)
    
    # Select a subset of potentially relevant features
    features = butterfly[['site', 'sloc', 'walk', 'time', 'vegwalktime', 'latitude', 'longitude']]
    
    # Convert categorical features to numerical codes using Label Encoder
    label_encoders = {}
    for column in features.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        features[column] = le.fit_transform(features[column])
        label_encoders[column] = le
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(features, butterfly['target'], test_size=0.2, random_state=42)
    
    # Print the shapes of the training and testing data
    print("Training set shape:", X_train.shape)
    print("Testing set shape:", X_test.shape)
    
    Training set shape: (3244, 7)
    Testing set shape: (812, 7)
    

    The dataset was split into training (3244 samples) and testing (812 samples) sets.

    The logistic regression model¶

    In [15]:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, classification_report
    
    # Impute missing values
    # For numerical features, use the mean
    # For categorical features, use the most frequent value
    numerical_imputer = SimpleImputer(strategy='mean')
    categorical_imputer = SimpleImputer(strategy='most_frequent')
    
    numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns
    categorical_cols = X_train.select_dtypes(include=['object']).columns
    
    X_train[numerical_cols] = numerical_imputer.fit_transform(X_train[numerical_cols])
    X_test[numerical_cols] = numerical_imputer.transform(X_test[numerical_cols])
    
    if len(categorical_cols) > 0:
        X_train[categorical_cols] = categorical_imputer.fit_transform(X_train[categorical_cols])
        X_test[categorical_cols] = categorical_imputer.transform(X_test[categorical_cols])
    
    # Initialize and train the logistic regression model
    logreg_model = LogisticRegression(max_iter=1000)
    logreg_model.fit(X_train, y_train)
    
    # Predict on the testing set and evaluate the model
    y_pred = logreg_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)
    
    print("Accuracy:", accuracy)
    print("Classification Report:\n", class_report)
    
    Accuracy: 0.8165024630541872
    Classification Report:
                   precision    recall  f1-score   support
    
               0       0.82      1.00      0.90       663
               1       0.00      0.00      0.00       149
    
        accuracy                           0.82       812
       macro avg       0.41      0.50      0.45       812
    weighted avg       0.67      0.82      0.73       812
    
    

    Accuracy: 81.65%

    Class 0 (Negative)

    • Precision: 82% - Of all the predictions for class 0, 82% were correct.
    • Recall: 100% - The model successfully identified all actual instances of class 0, which reflects high sensitivity for this class.
    • F1-Score: 90% - A high F1-score indicates excellent model performance for the negative class.

    Class 1 (Positive)

    • Precision: 0% - This indicates that there were no correct predictions for class 1; thus, precision is not applicable in this context due to no predicted positive instances.
    • Recall: 0% - The model failed to correctly identify any actual instances of class 1, indicating a complete lack of sensitivity for this class.
    • F1-Score: 0% - This extremely low F1-score underscores poor performance for the positive class, with significant room for improvement.

    Classification Report: The report indicates a high accuracy, but a deeper look reveals some issues. Specifically, the model predicts all instances as the majority class (class 0, or 'no presence of Asteraceae species'). This is evident from the precision, recall, and F1-score for class 1 being 0, which indicates that the model fails to identify any positive cases of the Asteraceae presence correctly. This is a common issue in imbalanced datasets where one class significantly outnumbers the other. The model tends to favor the majority class at the expense of the minority class.

    Confusion Matrix¶

    Visualizing the true positives, true negatives, false positives, and false negatives.

    In [16]:
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    from sklearn.impute import SimpleImputer
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
    import matplotlib.pyplot as plt
    
    # Initialize and train the logistic regression model with class weight 'balanced'
    logreg_balanced = LogisticRegression(max_iter=1000, class_weight='balanced')
    logreg_balanced.fit(X_train, y_train)
    
    # Predict on the testing set
    y_pred_balanced = logreg_balanced.predict(X_test)
    y_pred_proba_balanced = logreg_balanced.predict_proba(X_test)[:, 1]  # probabilities for the positive class
    
    # Compute confusion matrix
    conf_matrix_balanced = confusion_matrix(y_test, y_pred_balanced)
    
    # Compute ROC curve and AUC
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba_balanced)
    roc_auc = auc(fpr, tpr)
    
    # Plotting the confusion matrix
    plt.figure(figsize=(6, 5))
    plt.imshow(conf_matrix_balanced, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title('Confusion Matrix - Balanced Classes')
    plt.colorbar()
    tick_marks = np.arange(2)
    plt.xticks(tick_marks, ['Negative', 'Positive'], rotation=45)
    plt.yticks(tick_marks, ['Negative', 'Positive'])
    for i in range(conf_matrix_balanced.shape[0]):
        for j in range(conf_matrix_balanced.shape[1]):
            plt.text(j, i, conf_matrix_balanced[i, j], horizontalalignment="center", color="white" if conf_matrix_balanced[i, j] > conf_matrix_balanced.max()/2 else "black")
    plt.tight_layout()
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.show()
    
    No description has been provided for this image

    Confusion Matrix:

    • True Negatives (Top-Left): 309 - The model correctly predicted 'no presence' of Asteraceae species.
    • False Positives (Top-Right): 354 - The model incorrectly predicted 'presence' when there was none.
    • False Negatives (Bottom-Left): 62 - The model failed to predict 'presence' when there was.
    • True Positives (Bottom-Right): 87 - The model correctly predicted 'presence' of Asteraceae species.

    ROC Curve and AUC Score¶

    Evaluate the performance across different thresholds, showing the trade-off between sensitivity and specificity.

    In [17]:
    # Plotting the ROC curve
    plt.figure(figsize=(6, 5))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.tight_layout()
    plt.show()
    
    No description has been provided for this image

    The AUC (Area Under the Curve) score is 0.56, indicating moderate performance, which is relatively low, suggesting the model struggles to distinguish between the classes effectively. The ROC curve is near the diagonal line (random guess line), reflecting this modest performance.

    0.5: Represents a model that makes predictions no better than random guessing.

    1.0: Represents a perfect model that classifies all positive and negative examples correctly.

    AUC of 0.56 suggests the model is only slightly better than random guessing.

    Due to the closeness of the ROC curve to the diagonal line of no-discrimination further indicates that the model's performance is not very strong.

    implement oversampling and undersampling¶

    Oversampling the Minority Class

    In [21]:
    from sklearn.metrics import confusion_matrix, roc_auc_score
    from imblearn.over_sampling import RandomOverSampler
    
    # Initialize the RandomOverSampler object
    ros = RandomOverSampler(random_state=42)
    
    # Resample the dataset
    X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
    
    # Initialize and train the logistic regression model on the oversampled data
    logreg_ros = LogisticRegression(max_iter=1000)
    logreg_ros.fit(X_train_ros, y_train_ros)
    
    # Predict on the testing set
    y_pred_ros = logreg_ros.predict(X_test)
    y_pred_proba_ros = logreg_ros.predict_proba(X_test)[:, 1]
    
    # Evaluate the model
    print("Confusion Matrix for Oversampled Data:")
    print(confusion_matrix(y_test, y_pred_ros))
    print("ROC AUC for Oversampled Data:", roc_auc_score(y_test, y_pred_proba_ros))
    
    Confusion Matrix for Oversampled Data:
    [[312 351]
     [ 64  85]]
    ROC AUC for Oversampled Data: 0.556540840393979
    

    Undersampling the Majority Class

    In [22]:
    from imblearn.under_sampling import RandomUnderSampler
    
    # Initialize the RandomUnderSampler object
    rus = RandomUnderSampler(random_state=42)
    
    # Resample the dataset
    X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
    
    # Initialize and train the logistic regression model on the undersampled data
    logreg_rus = LogisticRegression(max_iter=1000)
    logreg_rus.fit(X_train_rus, y_train_rus)
    
    # Predict on the testing set
    y_pred_rus = logreg_rus.predict(X_test)
    y_pred_proba_rus = logreg_rus.predict_proba(X_test)[:, 1]
    
    # Evaluate the model
    print("Confusion Matrix for Undersampled Data:")
    print(confusion_matrix(y_test, y_pred_rus))
    print("ROC AUC for Undersampled Data:", roc_auc_score(y_test, y_pred_proba_rus))
    
    Confusion Matrix for Undersampled Data:
    [[346 317]
     [ 71  78]]
    ROC AUC for Undersampled Data: 0.5685363458754695
    

    ROC curves¶

    In [23]:
    # Plotting ROC curves for all models
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, label='Original Balanced (AUC = %0.2f)' % 0.560)
    plt.plot(roc_curve(y_test, y_pred_proba_ros)[0], roc_curve(y_test, y_pred_proba_ros)[1], label='Oversampled (AUC = %0.2f)' % 0.557)
    plt.plot(roc_curve(y_test, y_pred_proba_rus)[0], roc_curve(y_test, y_pred_proba_rus)[1], label='Undersampled (AUC = %0.2f)' % 0.569)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves Comparison')
    plt.legend(loc='lower right')
    plt.show()
    
    No description has been provided for this image

    Random Forest¶

    In [24]:
    from sklearn.ensemble import RandomForestClassifier
    
    # Initialize the Random Forest model
    random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Train the model
    random_forest.fit(X_train, y_train)
    
    # Predict on the testing set
    y_pred_rf = random_forest.predict(X_test)
    y_pred_proba_rf = random_forest.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
    roc_auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
    
    print("Random Forest Confusion Matrix:\n", conf_matrix_rf)
    print("Random Forest ROC AUC:", roc_auc_rf)
    
    Random Forest Confusion Matrix:
     [[663   0]
     [  0 149]]
    Random Forest ROC AUC: 1.0000000000000002
    

    The results from the Random Forest model showed a good performance with an ROC AUC of 1.0 and a confusion matrix indicating no false positives or false negatives.

    Gradient Boosting¶

    In [25]:
    from sklearn.ensemble import GradientBoostingClassifier
    
    # Initialize the Gradient Boosting model
    gradient_boosting = GradientBoostingClassifier(n_estimators=100, random_state=42)
    
    # Train the model
    gradient_boosting.fit(X_train, y_train)
    
    # Predict on the testing set
    y_pred_gb = gradient_boosting.predict(X_test)
    y_pred_proba_gb = gradient_boosting.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
    roc_auc_gb = roc_auc_score(y_test, y_pred_proba_gb)
    
    print("Gradient Boosting Confusion Matrix:\n", conf_matrix_gb)
    print("Gradient Boosting ROC AUC:", roc_auc_gb)
    
    Gradient Boosting Confusion Matrix:
     [[663   0]
     [112  37]]
    Gradient Boosting ROC AUC: 0.8727464140018424
    

    Plotting ROC curves for both models¶

    In [26]:
    plt.figure(figsize=(8, 6))
    plt.plot(roc_curve(y_test, y_pred_proba_rf)[0], roc_curve(y_test, y_pred_proba_rf)[1], label='Random Forest (AUC = 1.00)')
    plt.plot(roc_curve(y_test, y_pred_proba_gb)[0], roc_curve(y_test, y_pred_proba_gb)[1], label='Gradient Boosting (AUC = 0.873)')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves Comparison')
    plt.legend(loc='lower right')
    plt.show()
    
    No description has been provided for this image

    Feature importance¶

    In [28]:
    # Get feature importances from both models
    importances_rf = random_forest.feature_importances_
    importances_gb = gradient_boosting.feature_importances_
    
    # Summarize feature importances in a DataFrame
    feature_names = X_train.columns
    importances_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance_RF': importances_rf,
        'Importance_GB': importances_gb
    }).sort_values(by='Importance_GB', ascending=False)
    
    # Plotting feature importances
    fig, ax = plt.subplots(2, 1, figsize=(12, 12))
    importances_df.plot(kind='barh', x='Feature', y='Importance_RF', ax=ax[0], color='blue', title='Random Forest Feature Importance')
    importances_df.plot(kind='barh', x='Feature', y='Importance_GB', ax=ax[1], color='green', title='Gradient Boosting Feature Importance')
    plt.tight_layout()
    plt.show()
    
    No description has been provided for this image

    Random Forest: The site feature is the most significant, followed by latitude and sloc. This indicates that the location-related features play a crucial role in the Random Forest model's predictions. time seems to be the least important, which suggests that the timing of the observations isn't as critical to the model.

    Gradient Boosting: The site feature also leads in importance, similar to Random Forest, reinforcing the importance of location-related features in determining the presence of the species. walk and sloc also show considerable influence, which might indicate that specific conditions or characteristics captured by these features significantly impact the model.

    In [31]:
    import shap
    
    # Create a SHAP explainer object for Gradient Boosting model
    explainer = shap.TreeExplainer(gradient_boosting)
    shap_values = explainer.shap_values(X_test)
    
    # Plot SHAP values for the first 10 predictions
    shap.summary_plot(shap_values, X_test, plot_type="bar")
    
    No description has been provided for this image

    The significant features across both models are primarily related to location (site, latitude, longitude), which might be due to ecological factors specific to certain locations influencing the presence of the species.

    Cross validation¶

    In [32]:
    from sklearn.model_selection import cross_val_score
    
    # Set up k-fold cross-validation
    k = 5  # Number of folds
    
    # Random Forest cross-validation for accuracy
    rf_cv_accuracy = cross_val_score(random_forest, X_train, y_train, cv=k, scoring='accuracy')
    
    # Gradient Boosting cross-validation for accuracy
    gb_cv_accuracy = cross_val_score(gradient_boosting, X_train, y_train, cv=k, scoring='accuracy')
    
    print("Random Forest Average CV Accuracy:", np.mean(rf_cv_accuracy))
    print("Gradient Boosting Average CV Accuracy:", np.mean(gb_cv_accuracy))
    
    Random Forest Average CV Accuracy: 0.9916733245829292
    Gradient Boosting Average CV Accuracy: 0.8800852213281593
    
    In [33]:
    from sklearn.model_selection import GridSearchCV
    from sklearn.ensemble import RandomForestClassifier
    
    # Parameter grid for Random Forest
    param_grid_rf = {
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    
    # Setup the grid search
    grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
    grid_search_rf.fit(X_train, y_train)
    
    print("Best parameters for Random Forest:", grid_search_rf.best_params_)
    print("Best cross-validated accuracy for Random Forest:", grid_search_rf.best_score_)
    
    Best parameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
    Best cross-validated accuracy for Random Forest: 0.9916733245829292
    
    In [34]:
    from sklearn.ensemble import GradientBoostingClassifier
    
    # Parameter grid for Gradient Boosting
    param_grid_gb = {
        'n_estimators': [100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 4, 5]
    }
    
    # Setup the grid search
    grid_search_gb = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid_gb, cv=5, scoring='accuracy', n_jobs=-1)
    grid_search_gb.fit(X_train, y_train)
    
    print("Best parameters for Gradient Boosting:", grid_search_gb.best_params_)
    print("Best cross-validated accuracy for Gradient Boosting:", grid_search_gb.best_score_)
    
    Best parameters for Gradient Boosting: {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 200}
    Best cross-validated accuracy for Gradient Boosting: 0.9919819665582379